http://forums.psy.ed.ac.uk/R/P01582/essential-12/essential-12.html

Frequency, contingency, counts

 

table

 Contingency table from count data

xtabs

 Contingency table from count data in frequency-weighted format

as.data.frame.table

Data frame in frequency-weighted format from a contingency table

ftable

 "Flat" contingency table

prop.table

 Scale table cells

addmargins

 Add margins to a table

margin.table

Get marginal sums 

chisq.test

Chi-squared contingency table tests and goodness-of-fit tests

fisher.test

Fisher's exact test for 2x2 contingency tables

 

# Data frame with some dummy count data
d1 = data.frame(

g1=factor(rep(1:2, c(73,47)), labels=c("yes","no")),
g2=factor(c(rep(1:3, c(21,19,33)), rep(1:3, c(16,12,19))), labels=c("low","med","high")),
g3=rep(gl(4,5, labels=LETTERS[1:4]),6) )

# Same data in frequency-weighted format
d2 = as.data.frame.table(table(d1))

names(d2) = c("f1","f2","f3","freq") # Make names of d2 different from d1 so we can attach both
attach(d1)
attach(d2)

Contingency tables

The table and xtabs functions construct contingency tables (cross-tabulations) of count data. The table function is designed for "long" format count data. It takes one or more factors as arguments, or a data frame of factors, and returns a table where the cells are the counts at each combination of the factor levels. The xtabs function is designed for frequency-weighted data. It takes a formula giving the frequency variable and factors. The as.data.frame.table function is the inverse of xtabs in that it takes a contingency table and returns a data frame in frequency-weighted format.

# 1-dimensional tables
table(g1)
xtabs(freq~f1)

# 2-dimensional tables
table(g1,g2)
xtabs(freq~f1+f2)

# 3-dimensional tables (using ftable to make "flat" tables)
ftable(table(g1,g2,g3), row.vars=1:2)
ftable(xtabs(freq~f1+f2+f3), row.vars=1:2)

See also tapply and other aggregators for tables of aggregated data. For example tables of group means:

x = runif(120)
tapply(x, list(g1,g2), mean) # Table of group means
ftable(tapply(x, list(g1,g2,g3), mean), row.vars=1:2) # Flat table of group means

Working with tables

Tables of proportions

The prop.table function takes a table (or matrix), scales the values in its cells, and returns the scaled table. The default scaling is to divide each cell by the sum total of the cells. For a contingency table this scales the count data in each cell as relative frequencies. The optional margin argument sets the scale factor as a marginal sum, (ie. a row or a column sum). If margin=1 the proportion is with respect to the sum of the corresponding row. If margin=2 the proportion is with respect to the sum of the corresponding column.

m = table(g1,g2)
prop.table(m) * 100 # Relative frequency scaled up to a percentage
prop.table(m,1) # Scale cells by the row sum
prop.table(m,2) # Scale cells by the column sum

Marginal summaries

The margin.table function takes a table (or matrix) and returns the sum total or the marginal sums. See also functions: rowsum, colsum, rowSums, colSums, rowMeans, colMeans.

margin.table(m) # Sum total
margin.table(m, 1) # Row sums
margin.table(m, 2) # Column sums

The addmargins function takes a table (or matrix) and returns it with additional margins containing the marginal sums. The optional FUN argument can be used to pass a summary function.

addmargins(m) # Marginal sums
addmargins(m, margin=1) # Margin containing column sums
addmargins(m, margin=2) # Margin containing row sums
addmargins(m, FUN=mean) # Marginal means 

Writing a table to a file

The write.table function writes 1 and 2-dimensional tables. For example:

m = table(g1,g2) # Contingency table
write.table(m, file="", sep="\t", col.names=NA, quote=F) # See ?write.table for the meaning of col.names=NA

To add a name for the rows and columns set the dimname attribute and use the ftable and write.ftable functions. For example:

names(attr(m,"dimnames")) = c("Cond1","Cond2")
write.ftable(ftable(m), file="")

Tables of aggregated statistics may need to be rounded and formatted. For example:

m = tapply(x, list(g1,g2), mean) # Table of group means
write.table(round(m,3), file="", sep="\t", col.names=NA, quote=F) # Round each entry to 3dp
write.table(format(round(m,3)), file="", sep="\t", col.names=NA, quote=F) # Format to pad to 3dp

For 3 or 4-dimensional tables use the ftable and write.ftable functions. For example:

m = ftable(tapply(x, list(g1,g2,g3), mean), row.vars=1:2) # 3-way table of group means
names(attr(m, "row.vars")) = c("Cond1","Cond2") # Add names for the row variables
names(attr(m, "col.vars")) = "Subject" # ...and the column variable
write.ftable(m, file="", digits=3, quote=FALSE) # Pretty-print the table to a file

More general tables may be structured as an array with 'dimnames' and displayed using ftable. For example:

t1 = tapply(x, list(g1,g2,g3), mean)
t2 = tapply(x, list(g1,g2,g3), sd)
t3 = tapply(x, list(g1,g2,g3), length)
a1 = array(t1, dim=c(2,3,4), dimnames=list(levels(g1),levels(g2),levels(g3)))
a2 = array(c(t1,t2), dim=c(2,3,4,2), dimnames=list(levels(g1),levels(g2),levels(g3),c("Mean","Std Dev")))
a3 = array(c(t1,t2,t3), dim=c(2,3,4,3), dimnames=list(levels(g1),levels(g2),levels(g3),c("Mean","Std Dev","N")))
ftable(a1)
ftable(a2)
ftable(a3)

Chi Squared test

The chi-squared statistic can be used to test the significance of an association between samples of two (or more) categorical variables represented by factors. The association between factors is based upon comparing "observed frequencies" at each combination of factor levels, with "expected frequencies" that are averages of observed frequencies over combinations of factor levels. The chi-squared test is of the null that there is no significant difference between the observed and expected frequencies.

The chisq.test function performs chi-squared contingency table tests and goodness-of-fit tests. Given a 1-D table, (based on one factor), it performs a goodness-of-fit test. Given a 2-D table, (based on two factors), it performs a chi-squared test for independence of two factors. The function is designed only for 1-D and 2-D tables. However the summary method for table objects (returned by table or xtabs) also performs a chi-squared test for independence of factors, and this can handle tables based on more than two factors.

t2 = table(g1,g2) # 2-D contingency table
barplot(t2, beside=TRUE, legend=TRUE, ylim=c(0,40)) # Barplot of the contingency table
chisq.test(t2) # Pearson Chi-Square test
summary(t2) # ...(also performed by the table summary method)
chisq.test(t2)$observed # The observed frequencies, (same as t2)
chisq.test(t2)$expected # The expected frequencies

The test statistic is only approximately chi-squared, and becomes inaccurate when expected frequencies are "small" as defined using Cochran's rule-of-thumb, which states the test statistic is not close enough to Chi-Squared if: for a 2-D table any cell has expected frequency < 5, or for a larger table any cell has expected frequency < 1, or more than 20% of the cells have expected frequency < 5.

In the special case of a 2x2 table the chisq.test function applies Yates' continuity correction by default, which attempts to make the Pearson chi-squared statistic more accurate when the expected frequencies are small. However in that case it may be better to use Fisher's exact test instead. This is performed by the fisher.test function.

The chi-squared statistic is not a correlation coefficient in the sense that is cannot usefully be squared to produce a measure of effect-size for comparison purposes. Measures of the strength of association between categorical variables are provided by phi in package psych and by cramer.test function in package cramer. In the 2x2 case of association between dichotomous variables the odds ratio calculated by fisher.test is also useful.

 

detach(d1) # Clean up
detach(d2)

 

 

 

 

 


Click to close